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Abstract - In the course of realization of modern day robots, 
which not only perform tasks, but also behaves like human 
beings during their interaction with the natural environment, 
it is essential for us to impart knowledge of the underlying 
emotions in the spoken utterances of human beings to the 
robots, enabling them to be consistent, whole, complete and 
perfect. To this end, it is essential for them too to understand 
and identify the human emotions. For this reason, stress is 
laid now-a-days on the study of emotional content of the speech 
and accordingly speech emotion recognition engines have been 
proposed. This paper is a survey of the main aspects of speech 
emotion recognition, namely, features extractions and types 
of features commonly used, selection of most informed 
features from the original dataset of the features, and 
classification of the features according to different classifying 
techniques based on relative information regarding commonly 
used database for the speech emotion recognition. 

Index terms - S.E.R (speech emotion recognition), utterance, 
elicit. 

I. Introduction 

The technique of speech emotion recognition refers to 
the process of recognizing the emotional state of the human 
beings, using speech signals only. Various kinds of 
information regarding speech signal such as acoustics, 
prosodic and sometimes linguistic, are used either separately 
or in combination with each other to carry out the task of 
speech emotion recognition. Specific features are extracted 
from the above mentioned types of information, regarding 
speech signals that are useful in detecting and recognizing 
emotions of the speaker. Some applications of speech emotion 
recognition are such as recording the frustration of the 
customer in ticket reservations systems, so that the system 
can change its response accordingly [1]. Speech emotion 
recognition systems have also proved to be very helpful in 
call centres [2], where these systems are employed to 
recognize the emotional state of the customer so that the 
system or, for that matter, the caller could adapt. A very popular 
application of speech emotion recognition is in aircrafts, 
where, the advanced pilot control system periodically 
monitors' the emotional state of the pilot [3]. 

Emotion recognition through speech is a three step 
process, where first of all feature extraction. Second step is 
feature selection. Finally, last step of the process is 
classification of utterance into one of the predefined 
emotional class. 

This paper is divided into 9 sections. Section 2 provides 
scientific as well as linguistic meaning of an utterance. Section 
3 gives detailed explanation about emotions and its 
representation in emotional space. Section 4 provides detailed 
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information regarding some of very popular databases used 
for speech emotion recognition purpose. Section five deals 
with features used for the emotion classification. Section 6 
provides brief preview of feature selection algorithms. Finally, 
section 7 provides information regarding different classifiers. 

II. Utterance 

An utterance is a natural unit of speech bounded by 
breaths or pauses of the speaker. Linguists sometimes refer 
to utterance as simply a unit of speech under study. The act 
of uttering or vocal act of expression is also termed as 
utterance. We use the term utterance to refer to complete 
communicative units, which may consist of single word, 
phrase or clause. In the process of speech emotion 
recognition, the features are normally extracted by either 
dividing an utterance into frame which consists of voiced or 
unvoiced intervals or for the complete utterance [4], [5]. 

III. Emotion 

Although there is little consensus as to what defines emo- 
tion, yet generally emotion refers to conscious states that 
are experienced, which can be characterized primarily by bio- 
logical changes, physiological expressions and mental states. 
For use in the field of research, researchers define emotion as 
a two or three dimensional space. The space of emotions of 
consist dimensions of activation and valence. Thus, any 
emotion can be plotted as a point or small area in the space of 
emotion. Energy spent or energy associated with a particular 
emotion refers to activation. For example, emotions of anger 
and happiness are associated with large amount of energy, 
while emotions of sadness or neutrality are associated with 
low amount of energy. Second dimension of emotional space 
is valence. 

Valence provides the information regarding the positivity 
or negativity associated with the particular emotion. For 
example, sadness is associated with a lot of negativity while 
happiness is associated with a lot of positivity. 

Arousal of sympathetic nervous system is directly 
associated with the emotions. As with the emotions of joy, 
anger and fear, a large amount of energy is related to arousal 
of concerned nervous system, which in turn results in 
decreased salivation, high blood pressure, increased heart 
rate, etc. This in turn produces speech which is characterized 
as loud and fast, with high energy content at higher 
frequencies. On the other hand, with emotions like sadness, 
low blood pressure and decreased heart rate are associated 
with increased salivation, with low energy content at higher 
frequencies [6]. 
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Figure 1: Block diagram for speech emotion recognition 

IV. Database 

Percentage of successful recognition of a particular emo- 
tion in an utterance by the S.E.R engine depends upon the 
naturalness of the database used to perform the task. Com- 
monly used databases for the emotion recognition are mostly 
acted ones or the elicited ones. Databases created in studios 
or with the help of professionals are mainly used [5]. These 
kinds of databases are termed as artificial databases. There 
are mainly two kinds of artificial databases available. First 
one is simulated or acted speech database, in which profes- 
sional actors are used to utter the emotional speech sen- 
tences. Normally, several professional actors are used to ut- 
ter the same sentence in different emotions. As in real life 
situations, the content of the spoken words sometimes 
doesn't matter, instead human beings recognize the meaning 
of the dialogue in the context of underlying emotion in that 
dialogue. Second type of database used is the elicited speech 
database, non professional actors when try to elicit the sen- 
tences with emotions, the database created by these record- 
ings is termed as elicited speech database. For example, when 
people try to elicit dialogues spoken by the film stars in the 
films,There are several well known emotional databases avail- 
able, but only some of them are freely accessible. Several 
problems exist in the emotional speech databases available 
such as: 
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1 . Different databases are composed of utterances of some 
particular type of emotions only. Therefore, a universal 
database still does not exist. So we have to either use same 
database for training and testing or we have to use utterances 
of all the databases with common emotions only. Universality 
in training and testing is lost. 

2. The quality of database recordings of all the databases is 
not the same, which could affect the performance of the 
S.E.R engine. 

3. Descriptions regarding linguistic information and phonetic 
information are not provided in some databases, which pose 
problems when hybrid mechanisms are used for emotion 
recognitions [9]. 

Some commonly used databases for speech emotion 
recognition are Danish emotional speech [7], Berlin 
emotional speech database [8], eEnterface [9], SUSAS [10], 
KISMET [22]. 

V. Features For Speech Emotion Recognition 

A. Statistical Features 

The features used are maximum and minimum values and 
respective relative positions, Range (max-min), Arithmetic 
mean and quadratic mean, Zero crossing and mean crossing 
rate, Arithmetic mean of peaks, Centriod of contour, Number 
of peaks and mean distance between peaks, Linear regression 
coefficients and corresponding approximate error, Quadratic 
regression error, Standard deviation, Variance, skewness and 
kurtosis, Arithmatic, Quadratic and geometric means of 
absolute and non zero values 

B. Low Level Descriptors 

Energy - (log energy), Pitch - fundamental frequency (F0), 
Zero crossing rate, Probability of voicing, Spectral energy - 
F bands. 

C. MFCC 

Mel-frequency cepstral coefficients, 2 nd and 3 rd order 
coefficients of MFCC, 2 nd and 3 rd order derivative of MFCC. 

D. Average of Mean Log Spectrum. 

E. Perceptual Linear Predictive Coefficients. 

F. Nonlinear Teager energy operator Based Features (TEO) 

TEO-FM-VAR (TEO decomposed FM variation), 
TEO-AUTO-ENV (autocorrelation envelope area), 
TEO-CB-AUTO-ENV (critical band based TEO-AUTO-ENV). 

Features for the speech emotional recognition can be 
termed as information regarding the emotional content of the 
speech hidden in the speech signal. Studies have shown 
that the quality and the type of the features used for the 
emotional recognition from the speech signal have great 
impact on the efficiency of the SER engine used [12], [13]. 

Selection of the feature set to be used for speech emotion 
recognition also depends upon the type of classifier used for 
the classification task in the emotion recognition through 
speech [14], [15]. 

During the earlier phase of research in the field of emotion 
recognition through speech, development of new and 
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advanced classification systems takes the edge over other 
areas of development in this field, but in the recent phase, 
idea to select the optimal feature set, which increases the 
overall efficiency of successful recognition of the emotions 
in the speech signal [12]. 

Features extracted on the utterance level are normally 
termed as global features. Global features have shown better 
results than local features [16], [17], [18]. Global features are 
very few in numbers, so analysing them is faster and more 
efficient. Local features are extracted on frame level. Some 
new features are also proposed by researchers such as exci- 
tation source features [31], class level spectral features [14], 
glottal features [32], New Harmony features [34], and modu- 
lation spectral features [33]. 

VI. Feature Selection 

Feature selection phase acquires a lot of attention of 
researchers now-a-days in the field of speech emotion 
recognition. There are two types of feature selection 
algorithms available namely filter type algorithm and wrapper 
type algorithm [21], [20], [19]. 

Several strategies are formulated in terms of feature 
selection frameworks to obtain optimum feature subset for 
the classification purpose. In these frameworks, strategies 
are employed like one-vs.-rest and one-vs.-one strategy [12]. 

VII. Classification 

The last step of speech emotion recognition is 
classification. It involves classifying the raw data in the form 
of utterance or frame of the utterance into particular class of 
emotion on the basis of features extracted from the data. 
There are several types of conventional classifiers available 
for research in this field namely Hidden Markov Model 
(HMM), Gaussian Mixture Model (GMM), Support Vector 
Machine (SVM), Artificial Neural Network (ANN) etc. 

A. Multiple or Hierarchical Classifier Systems [30], [36], [37]. 

B. Hidden Markov model [23]. 

C. Gaussian Mixture [24 ]. 

D. Support Vector [25]. 

E. Artificial Neural Networks [26], [35]. 

Some other types of classifiers are also proposed by some 
researchers like 3DEC hierarchical classifier in which highly 
confused classes are separated first from rest of the classes 
and the procedure is repeated until all the classes are 
separated. Another proposed strategy is optimum-path forest 
classification in which training set is taken as a graph, whose 
nodes are samples and link weights are determined by the 
distance between the feature vectors of their corresponding 
nodes [29]. 

VIII. Conclusion 

In this survey paper, the key aspects of speech emotion 
recognition engine have been studied. Types of databases 
available, with important characteristics of each database are 
provided. Commonly used features for the speech emotion 
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recognition and some new features proposed in recent studies 
with their better classification results are also mentioned. 
Types of feature selection algorithms are discussed. A sample 
of the hierarchical classifier strategy is also provided for future 
work. 
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